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Abstract. Hashtags are semantico-syntactic constructs used across var¬ 
ious social networking and microblogging platforms to enable users to 
start a topic specific discussion or classify a post into a desired cate¬ 
gory. Segmenting and linking the entities present within the hashtags 
could therefore help in better understanding and extraction of informa¬ 
tion shared across the social media. However, due to lack of space delim¬ 
iters in the hashtags (e.g #nsavs snow den ), the segmentation of hashtags 
into constituent entities ( “NSA” and “Edward Snowden” in this case) is 
not a trivial task. Most of the current state-of-the-art social media ana¬ 
lytics systems like Sentiment Analysis and Entity Linking tend to either 
ignore hashtags, or treat them as a single word. In this paper, we present 
a context aware approach to segment and link entities in the hashtags 
to a knowledge base (KB) entry, based on the context within the tweet. 
Our approach segments and links the entities in hashtags such that the 
coherence between hashtag semantics and the tweet is maximized. To the 
best of our knowledge, no existing study addresses the issue of linking 
entities in hashtags for extracting semantic information. We evaluate our 
method on two different datasets, and demonstrate the effectiveness of 
our technique in improving the overall entity linking in tweets via addi¬ 
tional semantic information provided by segmenting and linking entities 
in a hashtag. 

Keywords: Hashtag Segmentation, Entity Linking, Entity Disambigua¬ 
tion, Information Extraction 


1 Introduction 

Microblogging and Social Networking websites like Twitter, Google+, Facebook 
and Instagram are becoming increasingly popular with more than 400 million 
posts each day. This huge collection of posts on the social media makes it an 
important source for gathering real-time news and event information. Microblog 
posts are often tagged with an unspaced phrase, prefixed with the sign 
known as a hashtag. 14% of English tweets are tagged with at least 1 hashtag 
with 1.4 hashtags per tweet [I]. Hashtags make it possible to categorize and track 
a microblog post among millions of other posts. Semantic analysis of hashtags 
could therefore help us in understanding and extracting important information 
from microblog posts. 
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In English, and many other Latin alphabet based languages, the inherent 
structure of the language imposes an assumption, under which the space char¬ 
acter is a good approximation of word delimiter. However, hashtags violate such 
an assumption making it difficult to analyse them. In this paper, we analyse the 
problem of extracting semantics in hashtags by segmenting and linking entities 
within hashtags. For example, given a hashtag like “#NSAvsSnowden” occur¬ 
ring inside a tweet, we develop a system that not only segments the hashtag 
into “NSA vs Snowden”, but also tells that “NSA” refers to “National Security 
Agency” and “Snowden” refers to “Edward Snowden”. Such a system has nu¬ 
merous applications in the areas of Sentiment Analysis, Opinion Mining, Event 
Detection and improving quality of search results on Social Networks, as these 
systems can leverage additional semantic information provided by the hashtags 
present within the tweets. Our system takes a hashtag and the corresponding 
tweet text as input and returns the segmented hashtag along with Wikipedia 
pages corresponding to the entities in the hashtag. To the best of our knowl¬ 
edge, the proposed system is the first to focus on extracting semantic knowledge 
from hashtags by segmenting them into constituent entities. 


2 Related Work 

The problem of word segmentation has been studied in various contexts in the 
past. A lot of work has been done on Chinese word segmentation. Huang et 
al. [3] showed that character based tagging approach outperforms other word 
based segmentation approaches for Chinese word segmentation. English URL 
segmentation has also been explored by various researchers in the past mm- 
All such systems explored length specific features to segment the URLs into 
constituent chunkt]^] Although a given hashtag can be segmented into various 
possible segments, all of which are plausible, the “correct” segmentation depends 
on the tweet context. For example, consider a hashtag l notacon\ It can be seg¬ 
mented into chunks ‘not, a, con’ or ‘nota, con’ based on the tweet context. The 
proposed system focuses on hashtag segmentation while being context aware. 
Along with unigram, bigram and domain specific features, content in the tweet 
text is also considered for segmenting and linking the entities within a hashtag. 

Entity linking in microposts has also been studied by various researchers re¬ 
cently. Various features like commonness, relatedness, popularity and recentness 
have been used for detecting and linking the entities in the microposts mmm- 
Although semantic analysis of microposts has been studied vastly, hashtags are 
either ignored or treated as a single word. In this work, we analyse hashtags by 
linking entities in the hashtags to the corresponding Wikipedia page. 


1 The term “chunk” here and henceforth refers to each of the segments s; in a segmen¬ 
tation S = si, S2, ■■■Si, ...s n - For example, in case of the hashtag #NSAvsSnowden, 
one of the possible segmentations ( S ) is NSA, vs, Snowden. Here, the words - “NSA”, 
“vs” and “Snowden” are being referred to as chunks. 
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3 System Architecture 

In this section, we present an overview of our system. We also describe the 
features extracted, followed by a discussion on training and learning procedures 
in Section [4J 

As illustrated in Fig. [l] the proposed system has 3 major components - 1) 
Hashtag Segmentations Seeder , 2) Feature Extraction and Entity Linking mod¬ 
ule , and 3) Segmentation Ranker. In the following sections, we describe each 
component in detail. 



Fig. 1: Schematic Diagram of the Overall System. 


3.1 Hashtag Segmentations Seeder 

Hashtag Segmentations Seeder is responsible for generating a list of possible 
segmentations of a given hashtag. We propose Variable Length Sliding Window 
technique for generating a set of highly probable hashtag segmentations for the 
given hashtag in the first step. 

The Variable Length Sliding Window technique is based on an assumption 
that for a given hashtag “#AXB”, if A and B are valid semantic units (a single 
word or a collection of words concatenated together without a space), it is rea¬ 
sonable to hypothesize that X is also a valid semantic unit. For example, in the 
hashtag u #followUCBleague , \ since, ‘ follow ’ and league' are well known dictio¬ 
nary words, and collectively this hashtag has some semantic meaning associated 
with it as it has occurred in a tweet, it is reasonable to assume that ‘HCB’ is 
also a valid semantic unit with some meaning associated with it. The length of 
the sliding window(X) is varied from MIN_LEN to MAX_LEN with each iteration, 
and the window is slid over the hashtag. 0(n 2 ) triplets of the form (A, X, B) 
are generated using the sliding window technique, where n is the length of the 
hashtag, X is the part of the hashtag lying within the window and A and B are 
the parts of the hashtag (of length > 0) that lie on the left and right of the 
window respectively. 
































4 


Piyush Bansal, Romil Bansal, Vasudeva Varma 


Each segment A and B of the triplet (A, X, B) is assigned a score according 
to the classically known Dynamic Programming based algorithm for Word Seg¬ 
mentation [7], hereby referred to as ViterbiWordSeg. 

ViterbiWordSeg takes a string as input and returns the best possible segmenta¬ 
tion BestSeg (ordered collection of chunks) for that string. The score assigned 
to the segmentation by ViterbiW ordSeg is the sum of log of probability scores 
of the segmented chunks based on the unigram language model. 

ViterbiWordSegScore(S) = E log(F > Unigram(Si')) (1) 

SiEBestSeg(S) 


We used Microsoft Web N-Gram Service^ for computing the unigram proba¬ 
bility scores. The aforementioned corpus contains data from the web, and hence 
various acronyms and slang words occur in it. This holds critical importance in 
the context of our task. Next, for each triplet of the form (A, X, B), we compute 
the Sliding Window score as follows. 

ScoresiidingWindow(A, X, B) = ViterbiWordSegScore(A)+ 
constant * log w {UnigramProb(X)) * WordLenProb(len(X))+ (2) 

V iterbiW ordS egScore(B) 

where WordLenProb(x ) is the Ordinate value at x in Figure [2] and the 
constant is set by experimentation. 

Also, for each triplet (A, X, B), the final segmentation, Seg(A, X, B) is the 
ordered collection of chunks (BestSeg(A), X, BestSeg(B)), where BestSeg(A) 
and BestSeg(B ) refer to the best segmentation (ordered collection of chunks) 
returned by ViterbiWordSeg(A) and ViterbiWordSeg(B) respectively. 



Word Length 


Fig. 2: Word Length vs. Frequency Percentage graph for 50M tweets. 


2 Microsoft Web N-Gram Services http://research.microsoft.com/en-us/ 
collaboration/focus/cs/web-ngram.aspx 
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To find the suitable value of MULLEN and MAX_LEN, we plot the percentage 
of frequency vs. word length graph using 50 million tweeti0 Figure [2] shows the 
plot obtained. 

It is observed that 79% of the tweet words are between length 2 to 6. Hence, 
we set MIN_LEN and MAX_LEN as 2 and 6 respectively. 

The major benefit of this technique is that we are able to handle named enti¬ 
ties and out of vocabulary (OOV) words. This is achieved by assigning score as a 
function of WordLenProb and smoothed backoff unigram probability (Equation 
[2j) for words within the window. 

Now that we have a list of 0(n 2 ) segmentations and their corresponding 
ScorcsiidingWindow , we pick the top k segmentations for each hashtag on the 
basis of this score. We set k = 20, as precision at 20 (P@20) comes out to be 
around 98%. This establishes that the subset of segmentations we seed, which is 
of 0(n 2 ), indeed contains highly probable segmentations out of a total possible 
2™ _1 segmentation^] 

3.2 Feature Extraction and Entity Linking 

This component of the system is responsible for two major tasks, feature extrac¬ 
tion from each of the seeded segmentations, and entity linking on the segmenta¬ 
tions. The features, as also shown in the System Diagram, are 1) Unigram Score, 
2) Bigram Score, 3) Context Score, 4) Capitalisation Score, and 5) Relatedness 
Score. The first feature, Unigram Score, is essentially the ViterbiWordSegScore 
computed in the previous step. In the following sections, we describe the rest of 
the features. 

Bigram Score: For each of the segmentations seeded by the Variable Length 
Sliding Window Technique, a bigram based score using the Microsoft Web N- 
Gram Services is computed. It is possible for a hashtag to have two perfectly valid 
segmentations. Consider the hashtag #Homesandgardens. Now this hashtag 
can be split as “Homes and gardens” which seems more probable to occur in a 
given context than “Home sand gardens”. Bigram based scoring helps to rank 
such segmentations, so that higher scores are awarded to the more semantically 
“appealing” segmentations. The bigram language model would score one of the 
above segmentations - “Homes and gardens” as 

P(Homes, and, gardens) « 

P(Homes\ < s >) * P(and\Homes)* (3) 

P(gardens\and) * P{< /s > \gardens) 

3 The dataset is available at http://demeter.inf.ed.ac.uk/cross/docs/fsd_ 
corpus.tar.gz 

1 For a string made up of n characters, we need to decide where to put the spaces so 
that we can get a sequence of valid words. There are n — 1 positions where a space 
can be placed, and each position may or may not have a space. Hence there are 2 n ~ 
segmentations. 
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Context Score: Context based score is an important feature. This is respon¬ 
sible for bubbling up of the segmentations with maximum contextual similarity 
with the tweet content. Using the CMU TweetNLP toolkit [SJ, words having 
POS tags like verb, noun and adjective are extracted both from the candidate 
segmentation of the hashtag and the tweet context, i.e. the text of the tweet 
other than the hashtag. Next, Wu Palmer similarity from Wordnet [5] is used on 
these two sets of words to find how similar a candidate segmentation is to the 
tweet context. These scores are normalized from 0 to 1. 

Capitalisation Score: Hashtags are of varied nature. Some hashtags have a 
camelcase-like capitalisation pattern as in # Homes AndGardens, while others 
have everything in lowercase or uppercase characters like #homesandgardens. 
However, we can easily see that camelcase conveys more information as it helps 
segment the hashtag into “Homes and gardens” and not “Home sAnd Gardens”. 
Capitalisation score helps us to capture the information conveyed by capitalisa¬ 
tion patterns within the hashtags. We use the following two rules. For a hashtag, 

— If a set of characters occuring together are in capitals as in #followUCBleague, 
they are considered to be a part of an “assumed cluster” (“UCB” in this 
case). 

— If it has a few capital letters separated by a group of lower case letters as in 
# Something Good, we assume the capital letters are delimiters and hence 
derive a few assumed clusters from the input hashtag. 

We calculate the capitalisation score for a given segmentation S containing 
chunks Si, S 2 ---Si..s n as 


n 


Scorecap = ^2 assumedClusterNotlntact(si) 


( 4 ) 


where assumedClusterNotlntact(si) returns 1, if s, fails to keep an assumed 
cluster intact, and 0 otherwise. 

Relatedness Score: Relatedness score measures the coherence between the 
tweet context and the hashtag segmentation. This score is computed on the 
basis of semantic relatedness between the entities present within the segmented 
hashtag and the tweet context. 

We calculated the relatedness between all the possible mentions in the seg¬ 
mented hashtag (Mjj) to all other possible mentions in the tweet context (Mr). 
For computing relatedness between the two entities, we used the Wikipedia- 
based relatedness function as proposed by Milne and Witten [J. 

Relatedness between two Wikipedia pages p a and pb is defined as follows: 


rel(p a ,Pb) = 1 - S 


( 5 ) 


where. 


^ _ log{max(\in(p a ),in(p b )\)) - log{\in(p a ) C\in{p b ) I) 

log(W) - log(min(\in(p a ),in(pb)\)) 


(6) 
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in(p a ) is the set of Wikipedia pages pointing to page p a and W is the total 
number of pages in Wikipedia. 

The overall vote given to a candidate page p a for a given mention a by a mention 
b is defined as 


vote b (p a ) 


E Pb cp g (b) rel(p b ,p a ).Pr{p b \b) 

\Pg(b)\ 


(7) 


where Pg{b) are all possible candidate pages for the mention b and Pr(p b \b) is 
the prior probability of b linking to a page p b . 

The total relatedness score given to a candidate page p a for a given mention a 
is the sum of votes from all other mentions in the tweet context ( Mt )■ 


rel a {pa ) = 5Z vote b (p a ) 

b£M T 


( 8 ) 


Now the overall relatedness score for a given hashtag segmentation, h is 


scoreh = 


EmeM H rel m (p a ).Pr(p a \m) 
\M h \ 


(9) 


The detected page p a for a given mention in the segmented hashtag is the 
Wikipedia page with the highest rel a (p a ). Since not all the entities are meaning¬ 
ful, we prune the entities with very low rel a (p a ) scores. In our case, the threshold 
is set to 0.1. This disambiguation function is considered as state-of-the-art and 
has also been adopted by various other systems pj] [TB]. The relatedness score, 
scoreh is used as a feature for hashtag segmentation. The entities in the seg¬ 
mented hashtag are returned along with the score for further improving the 
hashtag semantics. 


3.3 Segmentation Ranker 

This component of the system is responsible for ranking the various probable 
segmentations seeded by the Hashtag Segmentations Seeder Module. We gen¬ 
erated five features for each segmentation using Feature Extraction and Entity 
Linking Module in the previous step. These scores are combined by modelling 
the problem as a regression problem, and the combined score is referred to as 
ScoreR e g ress i on . The segmentations are ranked using ScoreR eg ression ■ In the end, 
the Segmentation Ranker outputs a ranked list of segmentations along with the 
entity linkings. 

In the next section, we discuss the regression and training procedures in 
greater detail. 


4 Training Procedure 

For the task of training the model, we consider the ScoreR egr ession of all correct 
segmentations to be 1 and all incorrect segmentations as 0. Our feature vector 
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comprises of five different scores calculated in Section 3.2 We use linear regres¬ 
sion with elastic net regularisation [14]. This allows us to learn a model that is 
trained with both LI and L2 prior as regularizer. It also helps us take care of the 
situation when some of the features might be correlated to one another. Here, p 
controls the convex combination of LI and L2. 

The Objective Function we try to minimize is 


1 


2 n 


samples 


-\\Xw-y\\l + ap\\w\\i 


— p) , 


( 10 ) 


where X , y and w are Model Matrix, Response Vector, and Coefficient Matrix 
respectively. The parameters alpha(a) and rho(p) are set by cross validation. 


5 Experiments and Results 

In this section we describe the datasets used for evaluation, and establish the 
effectiveness of our technique by comparing our results to a well known end-to- 
end Entity Linking system, TAGME [13], which works on short texts, including 
tweets. 


5.1 Evaluation Metrics and Datasets 

This section is divided into two parts. First, we explain the evaluation metrics 
in the context of our experiments, and later, we discuss the datasets used for 
evaluation. 


Evaluation Metrics We evaluated our system on two different metrics. Firstly, 
the system is evaluated based on its performance in the segmentation task. As 
the system returns a list of top-k hashtag segmentations for a given hashtag, 
we evaluated the precision at n (P@n) scores for the hashtag segmentation task. 
We also compared our P@1 score with Word Breaker^] which does the task of 
word segmentation. Secondly, the system is also evaluated on the basis of its 
entity linking performance on the hashtags. We computed Precision, Recall and 
F-Measure scores for the entities linked in the top ranked hashtag. For Entity 
Linking task, we used the same notions of Precision, Recall, and F-Measure as 
proposed by Marco et al. m ■ We compared our system with the state-of-the-art 
TAGME system. 

We show that adding semantic information extracted from the hashtags leads 
to an improvement in the overall tweet entity linking. For this, we performed a 
comparative study on the output of the TAGME system when a tweet is given 
with un-segmented hashtag vs. when it is given with segmented and entity-linked 
hashtag The case when un-segmented hashtag is fed to TAGME is considered 

5 http://web-ngram.research.microsoft.com/info/break.html 

6 For the segmented and entity-linked case, the linked entities in a hashtag were re¬ 
placed with the corresponding Wikipedia page titles. 
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Precision 

Recall 

F Score 

TAGME 

(Baseline) 

0.441 

0.383 

0.410 

Our System 

0.711 

0.841 

0.771 


(a) Comparative Accuracies for 
Hashtags Entity Linking task. 



Precision 

Recall 

F Score 

TAGME 

(Baseline) 

0.63 

0.69 

0.658 

Our System 
+ TAGME 

0.732 

0.91 

0.811 


(b) Comparative Accuracies for 
Overall Tweet Entity Linking task. 


n 

1 

2 

3 

5 

10 

20 

P@n 

0.914 

0.952 

0.962 

0.970 

0.974 

0.978 


(c) Various P@n for Hashtag 
Segmentation task. 


Table 1: Comparative Accuracies on the Microposts NEEL Dataset Q 

as a baseline to show how much improvement can be attributed to our method of 
enriching the tweet with additional semantic information mined by segmenting 
and linking entities in a hashtag. 


Datasets The lack of availability of a public dataset that suits our task has 
been a major challenge. To the best of our knowledge, no publicly available 
dataset contains tweets along with hashtags, and the segmentation of hashtag 
into constituent entities appropriately linked to a Knowledge Base. So, we ap¬ 
proached this problem from two angles - 1) Manually Annotated Dataset Gen¬ 
eration (where dataset is made public), 2) Synthetically generated Dataset. The 
datasets are described in detail below. 


1. Microposts NEEL Dataset: The Microposts NEEL Dataset m contains over 
3.5k tweets collected over a period from 15th July 2011 to 15th August 2011, 
and is rich in event-annotated tweets. This dataset contains Entities, and the 
corresponding linkages to DBPedia. The problem however, is that this dataset 
does not contain the segmentation of hashtags. We generate synthetic hashtags 
by taking tweets, and combining random number of consecutive words with each 
entity present within them. The remaining portion of the tweet that does not 
get combined is considered to be the tweet context. If no entity is present within 
the tweet, random words are combined to form the hashtag. This solves the 
problem of requiring human intervention to segment and link hashtags, since 
now we already know the segmentation as well as the entities present within the 
hashtag. 


' “TAGME (Baseline)” refers to the baseline evaluation where we give an unsegmented 
hashtag to TAGME to annotate. “Our System + TAGME” refers to the evaluation, 
where we first do segmentation and entity linking on hashtags using our system, and 
then feed them to TAGME to annotate either just the hashtag (Table a) or the full 
tweet (Table b). This is also discussed under “Evaluation Metrics” in subsection 


5.1 
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Precision 

Recall 

F Score 

TAGME 

(Baseline) 

0.398 

0.465 

0.429 

Our System 

0.731 

0.921 

0.815 


(a) Comparative Accuracies for the 
Hashtag Entity Linking task. 



Precision 

Recall 

F Score 

TAGME 

(Baseline) 

0.647 

0.732 

0.687 

Our System 
+ TAGME 

0.748 

0.943 

0.834 


(b) Comparative Accuracies for the 
Overall Tweet Entity Linking task. 


n 

1 

2 

3 

5 

10 

20 

P@n 

0.873 

0.917 

0.943 

0.958 

0.965 

0.967 


(c) Various P@n for Hashtag 
Segmentation task. 


Table 2: Comparative Accuracies on the Manually Annotated Stanford 
Sentiment Analysis Dataset. 


Our system achieved an accuracy (P@ 1) of 91.4% in segmenting the hashtag 
correctly. The accuracy of Word Breaker in this case was 80.2%. This, however, 
can be attributed to a major difference between our system and Word Breaker. 
Word Breaker is not context aware. It just takes an unspaced string, and tries to 
break it into words. Our method takes into account the relatedness between the 
entities in a hashtag and the rest of the tweet content. Also, various other hashtag 
specific features like Capitalisation Score play an important part in improving 
the accuracy. 

The comparative results of Entity Linking (in hashtags and overall), as well 
as P@n at various values of n for segmentation task are contained in Table [T] 
All the values are calculated by k-fold Cross-validation with k=5. 


2. Manually Annotated Stanford Sentiment Analysis Dataset: To overcome the 
limitation that a synthetically generated hashtag might not actually be equiva¬ 
lent to a real world hashtag, we sampled around 1.2k tweets randomly from the 
Stanford Sentiment Analysis Dataselj^J all of which contained one or more hash- 
tags in them. After this, we generated around 20 possible segmentations for each 
hashtag by passing the hashtag and tweet from Segmentations Seeder Module. 
In the end we had around 21k rows which were given to 3 human annotators to 
annotate as 0 or 1 depending on whether or not a given segmentation is correct 
(for a given hashtag) according to their judgement. 

Determining the “correct” segmentation for a given hashtag is particularly 
challenging, as there may be many answers that are equally plausible. It has been 
long established that there exist style disagreements among various editorial 
content (“Homepage” vs “Home page”). There are also various new words that 
come into existence like “TweetDeck” which are brand or product names. So, 


http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip 
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Added Feature 

P@1 

A 

Unigram 

0.834 

NA 

+ Bigram 

0.846 

+ 1.2% 

+ Context 

0.855 

+0.9% 

+ Capitalisation 

0.862 

+0.7% 

+ Relatedness 

0.873 

+1.1% 


Table 3: Importance of each feature 

our annotation guidelines in case of Stanford Sentiment Analysis Dataset allow 
for annotators to mark multiple segmentations as correct. 

The rows were labelled 0, if at least 2 annotators out of 3 agreed on the 
label 0, similarly the rows were labelled 1, if at least 2 out of 3 annotators 
agreed on the label 1. The labels are essentially ScoreR egress i on as described in 
Section [4j The value of Fleiss ’ Kappa (n ), which is a measure of inter annotator 
agreement, comes out to be 0.89, showing a good agreement between annotators. 
This dataset is made public to ease future research in this are^ 

Our system achieved a precision (P@l) of 87.3% in segmenting the hashtags 
correctly. The P@ 1 score of Word Breaker in this case was 78.9%. The difference 
in performance can again be attributed to same reasons as in the case of NEEL 
Dataset. 

The comparative results of Entity Linking (in hashtags and overall), as well 
as P@n at various values of n for the task of segmentation are contained in 
Table [2] All the values are calculated by k-fold Cross-validation with k=5. 

Results We demonstrate the effectiveness of our technique by evaluating on 
two different datasets. We also show how overall Entity Linking in tweets was 
improved, when our system was used to segment the hashtag and link the entities 
in the hashtag. We achieved an improvement of 36.1% F-Measure in extracting 
semantics from hashtags over the baseline in case of NEEL Dataset. We further 
show that extracting semantics led to overall increase in Entity Linking of tweet. 
In case of NEEL Dataset, we achieved an improvement of 15.3% F-Measure over 
baseline in overall tweet Entity Linking task as can be seen in Table [l] Similar 
results were obtained for the Annotated Stanford Sentiment Analysis Dataset as 
well, as shown in Table [2] Further, we measured the effectiveness of each feature 
in ranking the hashtag segmentations. The results are summarized in Table [3] 


6 Conclusions 

We have presented a context aware method to segment a hashtag, and link its 
constituent entities to a Knowledge Base (KB). An ensemble of various syntactic, 
as well as semantic features is used to learn a regression model that returns a 
ranked list of probable segmentations. This allows us to handle cases where 

9 Dataset: http://bit.ly/HashtagData 
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multiple segmentations are acceptable (due to lack of context in cases, where 
tweets are extremely short) for the same hashtag, e.g. #Homesandgardens. 

The proposed method of extracting more semantic information from hashtags 
can be beneficial to numerous tasks including, but not limited to sentiment 
analysis, improving search on social networks and microblogs, topic detection 
etc. 
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